!pip install --user pyclustertend
import matplotlib
import matplotlib.pyplot as plt
%matplotlib inline
import numpy as np
import pandas as pd
import plotly.express as px
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
%matplotlib inline
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import r2_score
import statsmodels.api as sm
from statsmodels.stats.proportion import proportions_ztest
import scipy.stats as stats
The dataset "insurance.csv" has the data regarding the medical insurance claims and contains the following columns
Columns:
1. age: age of primary beneficiary
2. sex: insurance contractor gender, female, male
3. bmi: Body mass index, providing an understanding of body, weights that are relatively high or low relative to height, objective index of body weight (kg / m ^ 2) using the ratio of height to weight, ideally 18.5 to 24.9
4. children: Number of children covered by health insurance / Number of dependents
5. smoker: Smoking status of the individual
6. region: the beneficiary's residential area in the US, northeast, southeast, southwest, northwest.
7. charges: Individual medical costs billed by health insurance
df = pd.read_csv('https://raw.githubusercontent.com/Nidhish-Krishna/ds_basics/main/insurance.csv')
df.info()
df.describe()
df.head()
sns.histplot(df, x='charges', hue='smoker', kde=True, hue_order=['yes', 'no'], alpha=.7)
plt.title('Distribution of charges in relation to smoking')
plt.show()
sns.histplot(data=df, x='bmi', stat='density', kde=True)
plt.show()
sns.countplot(df['sex'])
plt.title('Person by Gender', fontsize='16', fontweight='bold')
plt.xlabel('Gender', fontsize='14')
plt.ylabel('Total Persons', fontsize='14')
plt.show()
sns.countplot(df['region'])
plt.title('Person by Region', fontsize='16', fontweight='bold')
plt.xlabel('Region Name', fontsize='14')
plt.ylabel('Total Persons', fontsize='14')
plt.show()
sns.countplot(df['smoker'])
plt.title('Person by Smoking Status', fontsize='16', fontweight='bold')
plt.xlabel('Smoking Status', fontsize='14')
plt.ylabel('Total Persons', fontsize='14')
plt.show()
sns.countplot(df['children'])
plt.title('Person by Having Children', fontsize='16', fontweight='bold')
plt.xlabel('Number of Children', fontsize='14')
plt.ylabel('Total Persons', fontsize='14')
plt.show()
sns.histplot(df['age'], color = 'black')
plt.title('Person Age Distributions', fontsize='16', fontweight='bold')
plt.xlabel('Age', fontsize='14')
plt.ylabel('Total Persons', fontsize='14')
plt.show()
sns.histplot(df['bmi'], color = 'red')
plt.title('Persons BMI Distributions', fontsize='16', fontweight='bold')
plt.xlabel('BMI', fontsize='14')
plt.ylabel('Total Persons', fontsize='14')
plt.show()
sns.histplot(df['charges'], color = 'gray')
plt.title('Distribution of Expense', fontsize='16', fontweight='bold')
plt.xlabel('Expenses', fontsize='14')
plt.ylabel('Total Persons', fontsize='14')
plt.show()
px.scatter(df,
x = 'age',
y = 'charges',
marginal_y = 'violin',
trendline='ols')
px.scatter(df,
y = "charges",
x = "bmi", trendline='ols')
sns.catplot("sex","charges", data=df, kind="bar")
plt.show()
sns.catplot("children","charges", data=df, kind="bar")
px.scatter(df,
x='charges',
y='bmi',
size='age',
color= 'smoker',
hover_name = 'charges',
size_max = 12)
px.bar_polar(df,
r='charges',
theta='region',
color='sex',
template = 'plotly_dark')
One of the aims of the current study is to see whether smoking status was associated with insurance charges
Start by selecting just Smoking status and Medical Charges. There are 274 smokers and 1064 non-smokers in the given dataset
smoking_and_charges = df[['smoker', 'charges']]
smoking_and_charges
smoker = smoking_and_charges['charges'] [smoking_and_charges['smoker'] == 'yes']
smoker
non_smoker = smoking_and_charges['charges'] [smoking_and_charges['smoker'] == 'no']
non_smoker
smoking_and_charges.hist(by ='smoker')
smoker.hist(histtype='stepfilled', alpha=.5, bins=15) # default number of bins = 10
non_smoker.hist(histtype='stepfilled', alpha=.5, color=sns.desaturate("green", .75), bins=15)
plt.xlabel('Charges',fontsize=15)
plt.ylabel('Number of People',fontsize=15)
plt.show()
The distribution of the medical charges of patients who do not smoke appears to be shifted to the left of the distribution corresponding to non-smoking patients. The medical charges of the patients who do not smoke seem lower, on average than the charges of the patients who are smokers.
We can try to answer this question by a test of hypotheses. The chance model that we will test says that there is no underlying difference; the distributions in the samples are different just due to chance. Formally, this is the null hypothesis.
Null hypothesis: In the population, the distribution of charges of medical costs is the same for patients who don't smoke and for patients who are smokers. The difference in the sample is due to chance.
Alternative hypothesis: In the population, the medical charges of the patients who smoke have a higher medical expense, on average, than the patients who are non-smokers.
Test Statistic:
The alternative hypothesis compares the average charges of the two groups and says that the average charge for the patients who smoke is greater. Therefore it is reasonable for us to use the difference between the two group means as our statistic.
means_table = smoking_and_charges.groupby('smoker').mean()
means_table
observed_difference = means_table['charges'][1] - means_table['charges'][0]
To see how the statistic should vary under the null hypothesis, we have to figure out how to simulate the statistic under that hypothesis. We employ a method based on random permutations to do that.
If there were no difference between the two distributions in the underlying population, then whether a medical charge has the label 'yes' or 'no' with respect to smoking status should make no difference to the average. The idea, then, is to shuffle all the medical charges randomly among the individuals. This is called random permutation.
We take the difference of the two new group means: the mean of the shuffled weights assigned to the smokers and the mean of the shuffled weights assigned to the non-smokers. This is a simulated value of the test statistic under the null hypothesis.
smoking_and_charges
There are 1,338 rows in the table. To shuffle all the medical charges, we will draw a random sample of 1,338 rows without replacement. Then the sample will include all the rows of the table, in random order.
shuffled = smoking_and_charges.sample(1338,replace = False)
shuffled
shuffled_weights = shuffled['charges']
type(shuffled_weights)
original_and_shuffled= smoking_and_charges.assign(shuffled_weights=shuffled_weights.values )
original_and_shuffled
Each person now has a random charges assigned to them. If the null hypothesis is true, all these random arrangements should be equally likely. Let's see how different the average weights are in the two randomly selected groups.
all_group_means= original_and_shuffled.groupby('smoker').mean()
all_group_means
difference = all_group_means['shuffled_weights'][0]- all_group_means['shuffled_weights'][1]
difference
To get a sense of the variability, let's simulate the difference many times.
smoking_and_charges = df[['smoker', 'charges']]
shuffled = smoking_and_charges.sample(1338,replace = False)
shuffled_weights = shuffled['charges']
original_and_shuffled = smoking_and_charges.assign(shuffled_weights=shuffled_weights.values )
all_group_means= original_and_shuffled.groupby('smoker').mean()
difference = all_group_means['shuffled_weights'][0]- all_group_means['shuffled_weights'][1]
difference
Tests based on random permutations of the data are called permutation tests.
Simulate the test statistic – the difference between the averages of the two groups – many times and collect the differences in an array.
import numpy as np
import array
differences = np.zeros(5000)
for i in np.arange(5000):
smoking_and_charges = df[['smoker', 'charges']]
shuffled = smoking_and_charges.sample(1338,replace = False)
shuffled_weights = shuffled['charges']
original_and_shuffled = smoking_and_charges.assign(shuffled_weights=shuffled_weights.values )
all_group_means= original_and_shuffled.groupby('smoker').mean()
difference = all_group_means['shuffled_weights'][0]- all_group_means['shuffled_weights'][1]
differences[i]=difference
differences
import matplotlib
%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')
differences_df = pd.DataFrame(differences)
differences_df
differences_df.hist(bins = np.arange(-2500,2500,50))
plt.title('Prediction Under Null Hypotheses');
plt.xlabel('Differences between Group Averages',fontsize=15)
plt.ylabel('Units',fontsize=15);
print('Observed Difference:', observed_difference)
Notice how the distribution is centered around 0. This makes sense, because under the null hypothesis the two groups should have roughly the same average. Therefore the difference between the group averages should be around 0.
The observed difference in the original sample is about 23615.9635, which doesn't even appear on the horizontal scale of the histogram. The observed value of the statistic and the predicted behavior of the statistic under the null hypothesis are inconsistent.
The above is repeated again using t-statistic method
Aim: To prove that the charges of the smokers are greater than that of the non-smokers
Null Hypothesis : The average charges of smokers is less than or equal to nonsmokers
Alternate Hypothesis : The average charges of smokers is greater than nonsmokers
smoker=df.loc[df.smoker=="yes"]
smoker.head()
nonsmoker=df.loc[df.smoker=='no']
nonsmoker.head()
#print(smoker.count(),"\n",nonsmoker.count())
nonsmoker = nonsmoker[-274:]
#bprint(nonsmoker)
charges_yes = smoker.charges
charges_no = nonsmoker.charges
print('Average Cost charged to Insurance for smoker is {:.3f}'.format(charges_yes.mean()))
print('Average Cost charged to Insurance for non-smoker is {:.3f}'.format(charges_no.mean()))
sns.distplot(charges_yes,color='green',hist=True)
sns.distplot(charges_no,color='red',hist=True)
Green colour histogram shows charges of smokers.Red colour histogram shows the charges of non-smokers
sns.boxplot(x=df.charges,y=df.smoker,data=df).set(title="Box Plot: Smoker vs Charges")
alpha=0.05 #significance level = 0.05
t_statistic1, p_value1 = stats.ttest_ind(charges_yes, charges_no)
p_value_onetail=p_value1/2
print("Test statistic = {} , P-value ={} , OnetailPvalue = {}".format(t_statistic1,p_value1, p_value_onetail ))
if p_value1 <alpha :
print("Conclusion:Since P-value {} is less than alpha {} ". format (p_value_onetail,alpha) )
print("Reject Null Hypothesis that Average charges for smokers are less than or equal to nonsmoker.")
else:
print("Conclusion:Since P-0value {} is greater than alpha {} ". format (p_value_onetail,alpha))
print("Failed to Reject Null Hypothesis that Average charges for smokers are less than nonsmoker.")
Aim: To prove that the BMI of females is different from that of males
Null Hypothesis : There is no big difference between the BMI of Male and BMI of female
Alternate Hypothesis : There is a certain amount of difference between the BMI of Male and BMI of female
df_male=df.loc[df.sex=="male"]
df_female=df.loc[df.sex=="female"]
bmi_female=df_female.bmi
bmi_male=df_male.bmi
print(df_male.bmi.mean(),df_female.bmi.mean())
sns.distplot(bmi_male,color='green',hist=True)
sns.distplot(bmi_female,color='red',hist=True)
sns.distplot(bmi_male,color='green',hist=False)
sns.distplot(bmi_female,color='red',hist=False)
t_statistic2, p_value2 = stats.ttest_ind(bmi_male, bmi_female)
print("t-statistic = ",t_statistic2, ", p-value = ", p_value2)
if p_value2 < alpha :
print("Conclusion: Since P-value {} is less than alpha {} ". format (p_value2,alpha) )
print("Reject Null Hypothesis that there is no difference in bmi of men and bmi of female.")
else:
print("Conclusion:Since P-value {} is greater than alpha {} ". format (p_value2,alpha))
print("Failed to Reject Null Hypothesis that there is difference in bmi of men and bmi of female .")
correlation_values=df.corr()
correlation_values
sns.set()
figure, axes = plt.subplots(figsize=(11, 8))
sns.heatmap(correlation_values, linewidths=0.5, ax=axes, cmap='Reds')
plt.show()
Darker Red colour in correlation heatmap indicates stronger dependancy between the corresponding variables
import sklearn
sklearn. __version__
df[['charges', 'region']].groupby(['region']).agg(['min', 'max', 'mean'])
df.head()
df.sex.unique()
df['sex'] = df['sex'].replace(('female', 'male'), (1, 2))
df.head(3)
#Encoding categorical values for smoking status
df['smoker'] = df['smoker'].replace(('yes', 'no'), (2, 1))
df.head(3)
df.region.unique()
# Assumption - Southeast region makes highest expense so let region southeast = 2 and others are 1
df['region'] = df['region'].replace(('southeast', 'southwest', 'northwest', 'northeast'), (2, 1, 1, 1))
df.head(3)
y = df['charges']
x = df.drop(['charges'], axis = 1)
print(x.shape)
print(x.columns)
# #works on sklearn version - '0.20.3'
# x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.0, random_state=0)
# print('Size of x_train = ', x_train.shape)
# #print('Size of x_test = ', x_test.shape)
# print('Size of y_train = ', y_train.shape)
# #print('Size of y_test = ', y_test.shape)
# for other versions of sklearn - run the below
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=0)
#print(type(x_train))
temp1=[x_train,x_test]
temp2=[y_train,y_test]
x_train=pd.concat(temp1)
y_train=pd.concat(temp2)
x_test = pd.DataFrame(columns=['age', 'sex', 'bmi', 'children', 'smoker', 'region'])
y_test = pd.Series([],name="charges")
print('Size of x_train = ', x_train.shape)
print('Size of x_test = ', x_test.shape)
print('Size of y_train = ', y_train.shape)
print('Size of y_test = ', y_test.shape)
#sc = StandardScaler()
#x_train = sc.fit_transform(x_train)
#x_test = sc.transform(x_test)
model = LinearRegression()
model.fit(x_train, y_train)
y_predict = model.predict(x_train)
y_predict[:5]
all_v_charges_r2score = r2_score(y_train, y_predict)
all_v_charges_r2score
R2 score is 0.7501
Which means this Linear Regression Model can explain about 75% of the variance of target variable with respect to independant variables('age', 'sex', 'bmi', 'children', 'smoker', 'region')
df.columns
df.head()
x = df[['age', 'sex', 'bmi', 'children', 'smoker', 'region']]
y = df['charges']
x2 = sm.add_constant(x)
est = sm.OLS(y, x2)
est2 = est.fit()
print(est2.summary())
Looking at both coefficients, we have a p-value that is very low(zero). This means that there is a strong correlation between these coefficients and the target (charges).
Then, looking at the R² value, we have 0.75. Therefore, about 75.0% of the variability of charges is explained by the current regression model
With the given dataset, the result is a satisfactory one
sns.scatterplot(est2.fittedvalues, est2.resid);
plt.title("OLS Regression: Fitted values and residuals")
plt.show()
The above given plot is a scatterplot that has both fittedvalues and residuals of the OLS regression
The dataset "50-startups-data-regression.csv" contains data about 50 startup companies.
This dataset has data collected from New York, California and Florida about 50 business Startups "17 in each state".
The variables used in the dataset are Profit, R&D spending, Administration Spending, and Marketing Spending.
df1=pd.read_csv('https://raw.githubusercontent.com/Nidhish-Krishna/ds_basics/main/50-startups-data-regression.csv')
df1.info()
df1.describe()
correlation_values=df1.corr()
correlation_values
sns.set()
figure, axes = plt.subplots(figsize=(11, 8))
sns.heatmap(correlation_values, linewidths=0.5, ax=axes, cmap='Reds')
plt.show()
x = df1[['R&D Spend']].values.reshape(-1,1)
y = df1['Profit'].values.reshape(-1,1)
randd_v_profit_linear=LinearRegression()
randd_v_profit_linear.fit(x, y)
print("The linear model is: Y = {:.3} + {:.3}X".format(randd_v_profit_linear.intercept_[0],randd_v_profit_linear.coef_[0][0]))
print(randd_v_profit_linear.intercept_[0],randd_v_profit_linear.coef_)
predictions =randd_v_profit_linear.predict(x)
predictions[:5]
predictions = randd_v_profit_linear.predict(x)
plt.figure(figsize=(16, 8))
plt.scatter(
df1[['R&D Spend']],
df1['Profit'],
c='black'
)
plt.plot(
df1[['R&D Spend']],
predictions,
c='blue',
linewidth=2
)
plt.title("Linear Regression: R&D Spend v Profit")
plt.xlabel("R&D Spend")
plt.ylabel("Profit")
plt.show()
sns.residplot(x = "R&D Spend",
y = "Profit",
data = df1,
lowess = True,
color='green')
plt.title("Residual plot: R&D Spend v Profit")
plt.show()
randd_v_profit_r2score = r2_score(y, predictions)
randd_v_profit_r2score
x = df1['R&D Spend']
y = df1['Profit']
x2 = sm.add_constant(x)
est = sm.OLS(y, x2)
est2 = est.fit()
print(est2.summary())
Looking at both coefficients, we have a p-value that is very low (although it is probably not exactly 0). This means that there is a strong correlation between these coefficients and the target (profits)
Then, looking at the R² value, we have 0.947. Therefore, about 94.7% of the variability of profit is explained by the R&D Expenditure.
df1.info()
x = df1[['Administration']].values.reshape(-1,1)
y = df1['Profit'].values.reshape(-1,1)
admin_v_profit_linear=LinearRegression()
admin_v_profit_linear.fit(x,y)
print("The linear model is: Y = {:.3} + {:.3}X".format(admin_v_profit_linear.intercept_[0], admin_v_profit_linear.coef_[0][0]))
predictions = admin_v_profit_linear.predict(x)
predictions[:5]
predictions = admin_v_profit_linear.predict(x)
plt.figure(figsize=(16, 8))
plt.scatter(
df1[['Administration']],
df1['Profit'],
c='black'
)
plt.plot(
df1[['Administration']],
predictions,
c='blue',
linewidth=2
)
plt.title("Linear Regression: Administration v Profit")
plt.xlabel("Administration")
plt.ylabel("Profit")
plt.show()
sns.residplot(x = "Administration",
y = "Profit",
data = df1,
lowess = True)
plt.title("Residual Plot: Administration v Profit")
plt.show()
admin_v_profit_r2score = r2_score(y, predictions)
admin_v_profit_r2score
x = df1['Administration']
y = df1['Profit']
x2 = sm.add_constant(x)
est = sm.OLS(y, x2)
est2 = est.fit()
print(est2.summary())
Looking at both coefficients, we have a p-value that is very low. This means that there is a some good correlation between these coefficients and the target (profits)
Then, looking at the R² value, we have 0.04. Therefore, about 4% of the variability of profit is explained by the Administration expenditure. This is definitely not a good result.
df1.info()
x = df1[['Marketing Spend']].values.reshape(-1,1)
y = df1['Profit'].values.reshape(-1,1)
marketing_v_profit_linear=LinearRegression()
marketing_v_profit_linear.fit(x,y)
print("The linear model is: Y = {:.3} + {:.3}X".format(marketing_v_profit_linear.intercept_[0], marketing_v_profit_linear.coef_[0][0]))
marketing_v_profit_r2score = r2_score(y, predictions)
marketing_v_profit_r2score
predictions = marketing_v_profit_linear.predict(x)
predictions[:5]
predictions = marketing_v_profit_linear.predict(x)
plt.figure(figsize=(16, 8))
plt.scatter(
df1[['Marketing Spend']],
df1['Profit'],
c='black'
)
plt.plot(
df1[['Marketing Spend']],
predictions,
c='blue',
linewidth=2
)
plt.title("Linear Regression: Marketing Expenditure v Profit")
plt.xlabel("Marketing Spend")
plt.ylabel("Profit")
plt.show()
sns.residplot(x = "Marketing Spend",
y = "Profit",
data = df1,
lowess = True)
plt.title("Residual Plot: Marketing Expenditure v Profit")
plt.show()
admin_v_profit_r2score = r2_score(y, predictions)
admin_v_profit_r2score
x = df1['Marketing Spend']
y = df1['Profit']
x2 = sm.add_constant(x)
est = sm.OLS(y, x2)
est2 = est.fit()
print(est2.summary())
Looking at both coefficients, we have a p-value that is very low (although it is probably not exactly 0). This means that there is a strong correlation between this coefficient and the target (profits)
Then, looking at the R² value, we have 0.559. Therefore, about 55.9% of the variability of profit is explained by the Marketing Expenditure.
#Assuming state has no effect on profit - Analysis made solely based on expenditures only
df1.drop(['State'], axis=1,inplace=True)
df1.head()
x = df1[['R&D Spend','Administration','Marketing Spend']]
y = df1['Profit']
df1.columns
# x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.0, random_state = 1)
# print('Size of x_train = ', x_train.shape)
# print('Size of x_test = ', x_test.shape)
# print('Size of y_train = ', y_train.shape)
# print('Size of y_test = ', y_test.shape)
# for other versions of sklearn - uncomment and run the below
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=0)
# #print(type(x_train))
temp1=[x_train,x_test]
temp2=[y_train,y_test]
x_train=pd.concat(temp1)
y_train=pd.concat(temp2)
x_test = pd.DataFrame(columns=['R&D Spend', 'Administration', 'Marketing Spend'])
y_test = pd.Series([],name="charges")
##y_test = pd.DataFrame(columns=['Profit'])
print('Size of x_train = ', x_train.shape)
print('Size of x_test = ', x_test.shape)
print('Size of y_train = ', y_train.shape)
print('Size of y_test = ', y_test.shape)
all_v_profit_linear=LinearRegression()
all_v_profit_linear.fit(x_train, y_train)
print("The linear model is: Y = (",all_v_profit_linear.intercept_, ") + (",all_v_profit_linear.coef_[0],
") X1 + (",all_v_profit_linear.coef_[1], ") X2 + (",all_v_profit_linear.coef_[2], ") X3")
predictions = all_v_profit_linear.predict(x_train)
predictions[:5]
all_v_profit_r2score = r2_score(y_train, predictions)
all_v_profit_r2score
x = df1[['R&D Spend','Administration','Marketing Spend']]
y = df1['Profit']
x2 = sm.add_constant(x)
est = sm.OLS(y, x2)
est2 = est.fit()
print(est2.summary())
Looking at both coefficients, we have a p-value that is very low (although it is probably not exactly 0). This means that there is a strong correlation between these coefficient and the target (profits)
Then, looking at the R² value, we have 0.951. Therefore, about 95.1% of the variability of profit is explained by all the three coefficients taken together.
list_of_media = ["All Parameters", "R&D Spend", "Administration", "Marketing Spend"]
list_of_r2scores=[all_v_profit_r2score,randd_v_profit_r2score,admin_v_profit_r2score,marketing_v_profit_r2score]
plt.bar(list_of_media,list_of_r2scores)
plt.title("Models VS R2 Score")
plt.xlabel("Models")
plt.ylabel("R2 Score")
plt.show()
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import confusion_matrix, accuracy_score,classification_report
from sklearn import tree
The dataset "social-network-advertisements-data.csv" contains information determine whether a user purchased a particular product
dataset = pd.read_csv('https://raw.githubusercontent.com/Nidhish-Krishna/ds_basics/main/social-network-advertisements-data.csv')
X = dataset.iloc[:, :-1].values
y = dataset.iloc[:, -1].values
dataset.shape
dataset.info()
dataset.describe()
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 0)
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)
classifier1 = GaussianNB()
classifier1.fit(X_train, y_train)
print(classifier1.predict(sc.transform([[30,87000]])))
y_pred = classifier1.predict(X_test)
print(np.concatenate((y_pred.reshape(len(y_pred),1), y_test.reshape(len(y_test),1)),1)[:12])
from matplotlib.colors import ListedColormap
X_set, y_set = sc.inverse_transform(X_train), y_train
X1, X2 = np.meshgrid(np.arange(start = X_set[:, 0].min() - 10, stop = X_set[:, 0].max() + 10, step = 0.25),
np.arange(start = X_set[:, 1].min() - 1000, stop = X_set[:, 1].max() + 1000, step = 0.25))
plt.contourf(X1, X2, classifier1.predict(sc.transform(np.array([X1.ravel(), X2.ravel()]).T)).reshape(X1.shape),
alpha = 0.75, cmap = ListedColormap(('red', 'green')))
plt.xlim(X1.min(), X1.max())
plt.ylim(X2.min(), X2.max())
for i, j in enumerate(np.unique(y_set)):
plt.scatter(X_set[y_set == j, 0], X_set[y_set == j, 1], c = ListedColormap(('red', 'green'))(i), label = j)
plt.title('Naive Bayes (Training set)')
plt.xlabel('Age')
plt.ylabel('Estimated Salary')
plt.legend()
plt.show()
from matplotlib.colors import ListedColormap
X_set, y_set = sc.inverse_transform(X_test), y_test
X1, X2 = np.meshgrid(np.arange(start = X_set[:, 0].min() - 10, stop = X_set[:, 0].max() + 10, step = 0.25),
np.arange(start = X_set[:, 1].min() - 1000, stop = X_set[:, 1].max() + 1000, step = 0.25))
plt.contourf(X1, X2, classifier1.predict(sc.transform(np.array([X1.ravel(), X2.ravel()]).T)).reshape(X1.shape),
alpha = 0.75, cmap = ListedColormap(('red', 'green')))
plt.xlim(X1.min(), X1.max())
plt.ylim(X2.min(), X2.max())
for i, j in enumerate(np.unique(y_set)):
plt.scatter(X_set[y_set == j, 0], X_set[y_set == j, 1], c = ListedColormap(('red', 'green'))(i), label = j)
plt.title('Naive Bayes (Test set)')
plt.xlabel('Age')
plt.ylabel('Estimated Salary')
plt.legend()
plt.show()
from sklearn.metrics import confusion_matrix, accuracy_score,classification_report
cm = confusion_matrix(y_test, y_pred)
print(cm)
accuracy_score(y_test, y_pred)
print(classification_report(y_test, y_pred))
classifier2 = DecisionTreeClassifier(criterion = 'entropy', random_state = 0)
classifier2.fit(X_train, y_train)
print(classifier2.predict(sc.transform([[30,87000]])))
y_pred = classifier2.predict(X_test)
print(np.concatenate((y_pred.reshape(len(y_pred),1), y_test.reshape(len(y_test),1)),1)[:12])
from matplotlib.colors import ListedColormap
X_set, y_set = sc.inverse_transform(X_train), y_train
X1, X2 = np.meshgrid(np.arange(start = X_set[:, 0].min() - 10, stop = X_set[:, 0].max() + 10, step = 0.25),
np.arange(start = X_set[:, 1].min() - 1000, stop = X_set[:, 1].max() + 1000, step = 0.25))
plt.contourf(X1, X2, classifier2.predict(sc.transform(np.array([X1.ravel(), X2.ravel()]).T)).reshape(X1.shape),
alpha = 0.75, cmap = ListedColormap(('red', 'green')))
plt.xlim(X1.min(), X1.max())
plt.ylim(X2.min(), X2.max())
for i, j in enumerate(np.unique(y_set)):
plt.scatter(X_set[y_set == j, 0], X_set[y_set == j, 1], c = ListedColormap(('red', 'green'))(i), label = j)
plt.title('Decision Tree (Training set)')
plt.xlabel('Age')
plt.ylabel('Estimated Salary')
plt.legend()
plt.show()
from matplotlib.colors import ListedColormap
X_set, y_set = sc.inverse_transform(X_test), y_test
X1, X2 = np.meshgrid(np.arange(start = X_set[:, 0].min() - 10, stop = X_set[:, 0].max() + 10, step = 0.25),
np.arange(start = X_set[:, 1].min() - 1000, stop = X_set[:, 1].max() + 1000, step = 0.25))
plt.contourf(X1, X2, classifier2.predict(sc.transform(np.array([X1.ravel(), X2.ravel()]).T)).reshape(X1.shape),
alpha = 0.75, cmap = ListedColormap(('red', 'green')))
plt.xlim(X1.min(), X1.max())
plt.ylim(X2.min(), X2.max())
for i, j in enumerate(np.unique(y_set)):
plt.scatter(X_set[y_set == j, 0], X_set[y_set == j, 1], c = ListedColormap(('red', 'green'))(i), label = j)
plt.title('Decision Tree (Test set)')
plt.xlabel('Age')
plt.ylabel('Estimated Salary')
plt.legend()
plt.show()
from sklearn.metrics import confusion_matrix, accuracy_score
cm = confusion_matrix(y_test, y_pred)
print(cm)
accuracy_score(y_test, y_pred)
print(classification_report(y_test, y_pred))
classifier3 = LogisticRegression(random_state = 0)
classifier3.fit(X_train, y_train)
print(classifier3.predict(sc.transform([[30,87000]])))
y_pred = classifier3.predict(X_test)
print(np.concatenate((y_pred.reshape(len(y_pred),1), y_test.reshape(len(y_test),1)),1)[:12])
from matplotlib.colors import ListedColormap
X_set, y_set = sc.inverse_transform(X_train), y_train
X1, X2 = np.meshgrid(np.arange(start = X_set[:, 0].min() - 10, stop = X_set[:, 0].max() + 10, step = 0.25),
np.arange(start = X_set[:, 1].min() - 1000, stop = X_set[:, 1].max() + 1000, step = 0.25))
plt.contourf(X1, X2, classifier3.predict(sc.transform(np.array([X1.ravel(), X2.ravel()]).T)).reshape(X1.shape),
alpha = 0.75, cmap = ListedColormap(('red', 'green')))
plt.xlim(X1.min(), X1.max())
plt.ylim(X2.min(), X2.max())
for i, j in enumerate(np.unique(y_set)):
plt.scatter(X_set[y_set == j, 0], X_set[y_set == j, 1], c = ListedColormap(('red', 'green'))(i), label = j)
plt.title('Logistic Regression (Training set)')
plt.xlabel('Age')
plt.ylabel('Estimated Salary')
plt.legend()
plt.show()
from matplotlib.colors import ListedColormap
X_set, y_set = sc.inverse_transform(X_test), y_test
X1, X2 = np.meshgrid(np.arange(start = X_set[:, 0].min() - 10, stop = X_set[:, 0].max() + 10, step = 0.25),
np.arange(start = X_set[:, 1].min() - 1000, stop = X_set[:, 1].max() + 1000, step = 0.25))
plt.contourf(X1, X2, classifier3.predict(sc.transform(np.array([X1.ravel(), X2.ravel()]).T)).reshape(X1.shape),
alpha = 0.75, cmap = ListedColormap(('red', 'green')))
plt.xlim(X1.min(), X1.max())
plt.ylim(X2.min(), X2.max())
for i, j in enumerate(np.unique(y_set)):
plt.scatter(X_set[y_set == j, 0], X_set[y_set == j, 1], c = ListedColormap(('red', 'green'))(i), label = j)
plt.title('Logistic Regression (Test set)')
plt.xlabel('Age')
plt.ylabel('Estimated Salary')
plt.legend()
plt.show()
from sklearn.metrics import classification_report, confusion_matrix,accuracy_score,classification_report
cm = confusion_matrix(y_test, y_pred)
print(cm)
accuracy_score(y_test, y_pred)
print(classification_report(y_test, y_pred))
log_model = LogisticRegression(random_state = 0)
nb_model = GaussianNB()
#svc_model = SVC()
des_model = DecisionTreeClassifier(criterion="entropy")
# X = dataset.iloc[:, :-1].values
# y = dataset.iloc[:, -1].values
# X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 0)
models = [
{
'label': 'Logistic Regression',
'model': log_model
},
{
'label': 'Naive Bayes',
'model': nb_model
},
{
'label': 'Decision Tree Classification',
'model': des_model
}
]
from sklearn.metrics import roc_curve, roc_auc_score, auc
plt.clf()
plt.figure(figsize=(8,6))
for m in models:
m['model'].probability = True
temp1 = m['model'].fit(X_train,y_train)
probas=temp1.predict_proba(X_test)
fpr, tpr, thresholds = roc_curve(y_test, probas[:, 1])
roc_auc = auc(fpr, tpr)
plt.plot(fpr, tpr, label='%s ROC (area = %0.2f)' % (m['label'], roc_auc))
plt.plot([0, 1], [0, 1], 'k--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.0])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.legend(loc=0, fontsize='small')
plt.show()
From ROC plots, we can infer that, Naive Bayes is a better classifier among the three models, as it has a greater area under the curve(AUC=0.96) compared to the other two models.
from sklearn.model_selection import cross_val_score
accuracies = cross_val_score(estimator = classifier3, X = X_train, y = y_train, cv = 10)
print("Accuracy: {:.2f} %".format(accuracies.mean()*100))
print("Standard Deviation: {:.2f} %".format(accuracies.std()*100))
from sklearn.model_selection import cross_val_score
accuracies = cross_val_score(estimator = classifier2, X = X_train, y = y_train, cv = 10)
print("Accuracy: {:.2f} %".format(accuracies.mean()*100))
print("Standard Deviation: {:.2f} %".format(accuracies.std()*100))
from sklearn.model_selection import cross_val_score
accuracies = cross_val_score(estimator = classifier1, X = X_train, y = y_train, cv = 10)
print("Accuracy: {:.2f} %".format(accuracies.mean()*100))
print("Standard Deviation: {:.2f} %".format(accuracies.std()*100))
10-fold cross validation of all the models suggest that, Naive Bayes has the greatest accuracy among all the models(87.67%)
The dataset "mall-customers-dataset.csv" contains information about customers of mall(CustomerID,Genre/Gender,Age,Annual Income) and their spending scores
Using clustering algorithms, we try to identify different classes of customers from the given dataset
from sklearn.cluster import KMeans
from sklearn.preprocessing import scale
from pyclustertend import hopkins
import pandas as pd
import matplotlib.pyplot as plt
dataset1 = pd.read_csv('https://raw.githubusercontent.com/Nidhish-Krishna/ds_basics/main/mall-customers-dataset.csv')
X1 = dataset1.iloc[:, [3, 4]]
X=X1.values
dataset1.head()
dataset1.info()
type(X)
Y = scale(X)
arr=np.zeros(5000)
for i in range(5000):
val=hopkins(Y,200)
arr[i]=val
print(np.mean(arr))
According to Hopkins test, if Hopkins test score is near to 0.5, then clustering is not possible for a given dataset.
Hopkins test score should be either near to zero or near to one. Only then clustering is possible for the given dataset
Hopkins test score for the above is 0.30 which is not near 0.5. Hence, Clustering is possible for the given dataset
# Using the elbow method to find the optimal number of clusters
wcss = []
for i in range(1, 11):
kmeans = KMeans(n_clusters = i, init = 'k-means++', random_state = 42)
kmeans.fit(X)
wcss.append(kmeans.inertia_)
plt.plot(range(1, 11), wcss)
plt.title('The Elbow Method')
plt.xlabel('Number of clusters')
plt.ylabel('WCSS')
plt.show()
The curve starts to decrease well after k=5
So elbow method indicates that k=5 is an optimal choice of number of clusters for k-means
# Training the K-Means model on the dataset
kmeans = KMeans(n_clusters = 5, init = 'k-means++', random_state = 42)
y_kmeans = kmeans.fit_predict(X)
plt.scatter(X[y_kmeans == 0, 0], X[y_kmeans == 0, 1], s = 100, c = 'red', label = 'Cluster 1')
plt.scatter(X[y_kmeans == 1, 0], X[y_kmeans == 1, 1], s = 100, c = 'blue', label = 'Cluster 2')
plt.scatter(X[y_kmeans == 2, 0], X[y_kmeans == 2, 1], s = 100, c = 'green', label = 'Cluster 3')
plt.scatter(X[y_kmeans == 3, 0], X[y_kmeans == 3, 1], s = 100, c = 'cyan', label = 'Cluster 4')
plt.scatter(X[y_kmeans == 4, 0], X[y_kmeans == 4, 1], s = 100, c = 'magenta', label = 'Cluster 5')
plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], s = 300, c = 'yellow', label = 'Centroids')
plt.title('Clusters of customers')
plt.xlabel('Annual Income (k$)')
plt.ylabel('Spending Score (1-100)')
plt.legend()
plt.show()
labels=pd.DataFrame(kmeans.labels_)
clustered_data=X1.assign(Cluster=labels)
grouped_data=clustered_data.groupby(['Cluster']).mean().round(1)
grouped_data
from sklearn.metrics import silhouette_score,silhouette_samples
r=range(2,10)
res=[]
for ele in r:
cl=KMeans(n_clusters = ele, init = 'k-means++', random_state = 42)
y_kmeans = kmeans.fit_predict(Y)
silhouette_avg=silhouette_score(Y,y_kmeans)
res.append([ele,silhouette_avg])
results=pd.DataFrame(res,columns=['n-clusters','silhouette-score'])
pivot_km=pd.pivot_table(results,index='n-clusters',values='silhouette-score')
plt.figure()
sns.heatmap(pivot_km,annot=True,linewidths=0.5,fmt='.3f',cmap=sns.cm.rocket_r)
plt.tight_layout()
For the given dataset, Silhouette scores are same for all the values of number of clusters in K-means method.
import scipy.cluster.hierarchy as sch
from sklearn.cluster import AgglomerativeClustering
# Using the dendrogram to find the optimal number of clusters
plt.figure(figsize=(16,8))
dendrogram = sch.dendrogram(sch.linkage(X, method = 'ward'))
plt.title('Dendrogram')
plt.xlabel('Customers')
plt.ylabel('Euclidean distances')
plt.show()
Optimal number of clusters is found by using dendogram method in Hierarchical Clustering
Optimal number of clusters is the number of vertical lines present in the largest rectangular area possible between any two parallel horizontal lines in the dendogram plot
Here, in the above dendogram plot, the largest rectangular area possible between any two parallel horizontal lines is the area between the Eucledian distances 100 and 250 and that area has 5 vertical lines
So, we choose 5 clusters for Agglomerative Clustering
hc = AgglomerativeClustering(n_clusters = 5, affinity = 'euclidean', linkage = 'ward')
y_hc = hc.fit_predict(X)
plt.scatter(X[y_hc == 0, 0], X[y_hc == 0, 1], s = 100, c = 'red', label = 'Cluster 1')
plt.scatter(X[y_hc == 1, 0], X[y_hc == 1, 1], s = 100, c = 'blue', label = 'Cluster 2')
plt.scatter(X[y_hc == 2, 0], X[y_hc == 2, 1], s = 100, c = 'green', label = 'Cluster 3')
plt.scatter(X[y_hc == 3, 0], X[y_hc == 3, 1], s = 100, c = 'cyan', label = 'Cluster 4')
plt.scatter(X[y_hc == 4, 0], X[y_hc == 4, 1], s = 100, c = 'magenta', label = 'Cluster 5')
plt.title('Clusters of customers')
plt.xlabel('Annual Income (k$)')
plt.ylabel('Spending Score (1-100)')
plt.legend()
plt.show()
labels=pd.DataFrame(hc.labels_)
clustered_data=X1.assign(Cluster=labels)
grouped_data=clustered_data.groupby(['Cluster']).mean().round(1)
grouped_data
from sklearn.metrics import silhouette_score,silhouette_samples
r=range(2,12)
res=[]
for ele in r:
hc = AgglomerativeClustering(n_clusters = 5, affinity = 'euclidean', linkage = 'ward')
y_hc = hc.fit_predict(X)
silhouette_avg=silhouette_score(Y,y_hc)
res.append([ele,silhouette_avg])
results=pd.DataFrame(res,columns=['n-clusters','silhouette-score'])
pivot_km=pd.pivot_table(results,index='n-clusters',values='silhouette-score')
plt.figure()
sns.heatmap(pivot_km,annot=True,linewidths=0.5,fmt='.3f',cmap=sns.cm.rocket_r)
plt.tight_layout()
Similar to k-means method, for the given dataset, Silhouette scores are same for all the values of number of clusters in Hierarchical method.